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Abstract 

We build a general and highly applicable clustering theory, which we call 
cross-entropy clustering (shortly CEC) which joins advantages of classical k- 
means (easy implementation and speed) with those of EM (affine invariance 
and ability to adapt to clusters of desired shapes). Moreover, contrary to 
k-means and EM, CEC finds the optimal number of clusters by automatically 
removing groups which carry no information. 

Although CEC, similarly like EM, can be build on an arbitrary family 
of densities, in the most important case of Gaussian CEC the division into 
clusters is affine invariant, while the numerical complexity is comparable to 
that of k-means. 

Keywords: clustering, cross-entropy, memory compression 



1. Introduction 

1.1. Motivation 

As is well-known, clustering plays a basic role in many parts of data 
engineering, pattern recognition and image analysis PQ El El IH E] ■ Thus it 
is not surprising that there are many methods of data clustering, many of 
which however inherit the deficiencies of the first method called k-means 
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(a) Mouse-like set. 



(b) k-means with (c) k- means with k = (d) Spherical CEC. 
fc = 3. 10. 



Figure 1: Clustering of the uniform density on mouse-like set (Fig. |l(a)[ ) by standard 
k-means algorithm with fc = 3 (Fig. 1(b) I and fc = 10 (Fig. 1(c) I compared with Spherical 



CEC (Fig. 1(d) I with initially 10 clusters (finished with 3). 



[HI [7]. Since k-means has the tendency to divide the data into spherical 
shaped clusters of similar sizes, it is not affine invariant and does deal well 
with clusters of various sizes. This causes the so-called mouse-effect, see 



Figure 1(b) Moreover, it does not find the right number of clusters, see l(c 



and consequently to apply it we usually need to use additional tools like gap 
statistics [HI |9j. Since k-means has so many disadvantages, one can ask why 
it is so popular. One of the possible answers lies in the fact that k-means is 
simple to implement and very fast comparing to more advanced clustering 
methods like EM and classification EM [IUl[n]. 

So let us now discuss EM, the other end approach to clustering. It is based 
on family of densities J-' which convex combination we allow to estimate the 
density of the data-set we study. By modifying we can adapt our method 
to the search of clusters of various types [I2] . The disadvantages follow from 
the fact that EM is relatively slow and not well-adapted to dealing with large 
data-set^ Let us also add that EM, analogously as k-means, does not find 
the right number of clusters. 

In our paper we construct a general cross-entropy clustering (CEC) theory 
which simultaneously joins, and even overcomes, the clustering advantages of 



^The disadvantages of common clustering methods are excellently summarized in the 
third paragraph of [T^: "[...] The weaknesses of k-MEANS result in poor quality cluster- 
ing, and thus, more statistically sophisticated alternatives have been proposed. [...] While 
these alternatives offer more statistical accuracy, robustness and less bias, they trade this 
for substantially more computational requirements and more detailed prior knowledge 
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(a) Four gaussians. (b) 4 clusters (c) Bear-like set. (d) 6 clusters. 



Figure 2: Gaussian CEC starting from 10 initial clusters. 



classical k-means and EM. The aim of this paper is to study the theoretical 
background of cross-entropy clustering. Due to its length we decided to 
illustrate it only on basic example^ 

1.2. Main idea 

We based CEC on the observation that it is often profitable to use various 
compression algorithms specialized in different data types. We apply this ob- 
servation in reverse, namely we group/cluster those data together which are 
compressed by one algorithm from the preselected set of compressing algo- 
rithm^ In development of this idea we were influenced by the classical 
Shannon Entropy Theory [161 El El E] and Minimum Description Length 
Principle [201 EI] • In particular we were strongly inspired by the application 
of MDLP to image segmentation given in [221 123] • 

From theoretical point of view our basic idea lies in applying cross-entropy 
to many "compressing" densities. Its greatest advantage is the automatic 
reduction of unnecessary clusters: contrary to the case of classical k-means 
or EM, there is a memory cost of using each cluster. Consequently from cross- 
entropy clustering point of view it is in many cases profitable to decrease the 
number of used clusters. 

Example 1.1. To visualize our theory let us look at the results of Gaussian 
CEC given in Figure^ In both cases we started with = 10 initial randomly 
chosen clusters which were reduced automatically by the algorithm. 



^For the sample application of CEC in classification and recognition of elliptic shapes 
we refer the reader to jl5) 

•^We identify a coding/compressing algorithm with a subdensity, see the next section 
for detailed explanations. 
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In practical implementations our approach can viewed as a generalized 
and "modified" version of the classical k-means clustering As a consequence 
the complexity of the CEC is usually that of k-means and one can easily 
adapt most ideas used in various versions of k-means to CEC. 

Since CEC is in many aspects influenced by EM let us briefly summarize 
the main similarities and differences. Suppose that we are given a continuous 
probability measure /i (which represents our data) with density and fixed 
densities /i, . . . , by combination of which we want to approximate 

• The basic goal of EM is to find probabilities pi, . . . ,pk such that the 
approximation 

U^Pifi + ■■■+Pkfk (1-1) 

is optimal. 

• In CEC we search for partition 0/ into (possibly empty) pairwise 
disjoint sets Ui, . . . ,Uk and probabilities pi, . . . ,Pk such that the ap- 
proximation 

/m U ... Upfc/fcic/, (1.2) 

is optimal. 

Observe that as a result of CEC we naturally obtain the partition of the 



space into sets {Ui)^^^. Another crucial consequence of the formula (1.2) is 
that contrary to the earlier approaches based on MLE we approximate 
not by a density, as is the case in (1.1), but subdensitjj^ 



1.3. Contents of the paper 

For the convenience of the reader we now briefly summarize the contents 
of the article. In the following section we discuss (mostly) known results 
concerning entropy which we will need in our study. In particular we identify 
the subdensities with coding/compressing algorithms. In the third section 
we provide a detailed motivation and explanation of our basic idea, which 
allows to interpret the cross-entropy for the case of many "coding densities" . 
More precisely, given a partition Ui, . . . ,Uk of and subdensity families 

k 

J-'i, . . . , J-fc, we introduce the subdensity family l±l (J^j|[/j), which consists of 

i=l 



^By subdensity we understand a measurable nonnegative function with integral not 
greater then one. 
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those acceptable codings in which elements of f/j are compressed by a fixed 
element from J-^. We also show how to apply classical Lloyds and Hartigan 
approaches to cross-entropy minimizations. 

The last section contains applications of our theory to clustering. We first 
consider a general idea of cross- entropy -clustering, which aim is to find a 
/i-partition (f/j)^^^ minimizing 

This allows to investigate the T-divergence of the fx-partition {Ui)^^i 
d,{J^; mti) ■■= H\^^\\:F) - H\^i\\ J^(-F|f/,)) 

which measures the validity of the clustering {Ui)^^^. 

Next we proceed to the study of clustering with respect to various Gaus- 
sian subfamilies. First we investigate the most important case of Gaussian 
CEC and show that it reduces to the search for the partition (?7()jLi of the 
given data-set U for which the value of 

k 

X^p(t/.)-[-ln(p(f/,)) + ^lndet(S^J] 

i=l 

is minimal, where piV) = card(l^)/card(f/) and Sy denotes the covariance 
matrix of the set V. It occurs that the Gaussian clustering is affine invariant. 

Then we study clustering based on the Spherical Gaussians, that is those 
with covariance proportional to identity. Comparing Spherical CEC to clas- 
sical k-means we obtain that: clustering is scale and translation invariant 
and clusters do not tend to be of fixed size. Consequently we do not obtain 
the mouse effecd 

Example 1.2. Let us observe on Figure^the comparison of Spherical CEC 
with classical k-means on the Mickey-Mouse-like set. We see that Spherical 
CEC was able to find the "right" number of clusters, and that the clusters 
have "reasonable" shapes. 



^Let us add that in [25 the authors present a numerical modification of k-means to 
allow dealing with spherical shaped clusters of various size. 
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To apply Spherical clustering we need the same information as in the 
classical k-means: in the case of k-means we seek the splitting of the data 

k 

U C into k sets (f/i)f=i such that the value of ^ p{Ui) ■ Du. is minimal, 

i=l 

where Dy = cardiv) S 11^ ~ Illy IP denotes the mean within cluster V sum 

of squares (and my is the mean of V). It occurs that the Gaussian spherical 
clustering in reduces to minimization of 

TV 

j2pm-[-Hpm) + -\nDu^]. 

i=l 

Next we proceed to the study of clustering by Gaussians with fixed co- 
variance. We show that in the case of bounded data the optimal amount of 
clusters is bounded above by the maximal cardinality of respective e-net in 
the convex hull of the data. We finish our paper with the study of cluster- 
ing by Gaussian densities with covariance equal to si and prove that with s 
converging to zero we obtain the classical k-means clustering, while with s 
growing to oo data will form one big group. 

2. Cross-entropy 

2.1. Compression and cross- entropy 

Since CEC is based on choosing the optimal (from the memory point of 
view) coding algorithms, we first establish notation and present the basics of 
cross-entropy compression. 

Assume that we are given a discrete probability distribution u on a. finite 
set X = {xi, . . . , Xk} which attains the values Xj with probabilities /j. Then 
roughly speaking (TB] the optimal code-length^ in the case we use coding 
alphabet consisting of d symbols to code z/ are given by k = — log^/j, and 
consequently the expected code length is given by the entropy 

hdiiy) := E Ui = E /. ■ (- log, /.) = lib E sh(/i), 

i=l i=l i=l 



'We accept arbitrary, not only integer, code-lengths. 
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where sh(x) denotes the Shannon function defined by —x ■ Inx if x > and 
sh(0) := 0. We recall that for arbitrary code-lengths k to be acceptable they 

k 

have to satisfy Kraft's inequality ^ < 1. 

i=l 

If we code a discrete probability measure /i (which attains Xi with proba- 
bility Qi) by the code optimized for a subprobabilistic measure u we arrive at 
the definition of cross- entropy h^{fJ'\\f), which is given as the expected code 
length 

k 

KW) := J]^7.-(-log,/.). 

1=1 

Let us proceed to the case of continuous probability measure z/ on M.^ 
(with density f,^). The role of entropy is played by differential entropy (which 
corresponds to the limiting value of discrete entropy of coding with quanti- 
zation error going to zero [T6]): 

hd{i^) ■■= j fu{x) ■ {~\ogd fy{x))dx = ^ j sh{f^{x))dx, 

where fi, denotes the density of the measure u. In fact, as was the case of 
discrete spaces, we will need to consider codings produced by subprobability 
measures. 

Definition 2.1. We call a nonnegative measurable function f : — t- M.^ a 
subdensity if J^m f{x)dx < 1. 

Thus the differential code-length connected with subdensity / is given by 

/(a;) = -log,/(x). (2.1) 

Dually, an arbitrary measurable function x — )■ l{x) is acceptable as a "dif- 
ferential coding length" if it satisfies the differential version of the Kraft's 
inequality: 

' d-^^^'^dx <l, (2.2) 

which is equivalent to saying that the function f{x) := d~^^^^ is a subdensity. 

From now on, if not otherwise specified, by yU we denote either continuous 
or discrete probability measure on M^. 
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Definition 2.2. We define the cross-entropy of with respect to subdensity 
/ by: 

H^W)--= j -\ogJ{y)d^i{y). (2.3) 

It is well-known that if ^ has density the minimum in the above 
integral over all subdensities is obtained for f = (and consequently the 
cross-entropy is bounded from below by the differential entropy). 

One can easily get the following: 

Observation 2.1. Let f be a given subdensity and A an invertible affine 
operation. Then 

(/i o A-^WfA) = (/ill/) + log, IdetAI, 

where J'a is a subdensity defined by 

fA-.x^ f{A-'x)/\detA\, (2.4) 

and detA denotes the determinant of the linear component of A. 

In our investigations we will be interested in (optimal) coding for fi by 
elements of a set of subdensities J-", and therefore we put 

One can easily check that if J-" consists of densities then the search for 
H^{n\\J-') reduces to the maximum likelihood estimation of measure fi by 
the family J-". Thus by MLE(/i|| J-") we will denote the set of all subdensities 
from J-" which realize the infimum: 

MLE{^^\\r) ■.= {feT: H^{fi\\f) = //^(/xll^)}. 

In proving that the clustering is invariant with respect to the affine transfor- 



mation A we will use the following simple corollary of Observation 2.1 

Corollary 2.1. Let T be the subdensity family and A : — )■ an invert- 
ible affine operation. By Ta we denote {/^ : / G J-"}, where fA is defined by 



(|2J). Then 

H^ifi o A-'\\Ta) = H^if^m + log, |detA|. (2.5) 

As we know entropy in the case when we code with d symbols is a rescaling 
of entropy computed in Nats, which can be symbolically written as: Hd{-) = 
^Hf.{-). Consequently, for the shortness of notation when we compute the 
entropy in Nats we will omit the subscript e in the entropy symbol 

i/xf.) := ffxf.). 
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2.2. Cross- entropy for Gaussian families 

By and we denote the mean and covariance of the measure fi, that 

is 



For measure /i and measurable set U such that < yu(f/) < oo we introduce 
the probabihty measure fiu 

and use the abbreviations 



X] 



The basic role in Gaussian cross-entropy minimization is played by the 
following result which says that we can reduce computation to gaussian fam- 
ilies. Since its proof is essentially known part of MLE, we provide here only 
its short idea. 

Given symmetric positive matrix S, we recall that by the Mahalanobis 
distance ESI I2E] we understand 



\x 



yh ■= - yf^ - y)- 



By Af{m, S) we denote the normal density with mean m and covariance E, 
which as we recall is given by the formula 

Mm, T,)(x) := , exp(|||a; — 

^ ^ v/(27r)^det(S) '""^ 

Theorem 2.1. Let fi be a discrete or continuous probability measure with 
well-defined covariance matrix, and let m G and positive- definite sym- 
metric matrix S be given. 
Then 

i/x(/i||Mm,S))=ifX(/^gl|Mm,S)), 

where fig denotes the probability measure with Gaussian density of the same 
mean and covariance as fi (that is the density of fig equals Nivci^^T,^)). 
Consequently 

i/x(/.||Mm,S)) = ^ln(27r) + ^||m-m^|||+^tr(S-iS^) + ^lndet(S). (2.6) 
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Sketch of the proof. We consider the case when /i is a continuous measure 
with density One can easily see that by applying trivial affine transfor- 
mations and (2.5) it is sufficient to prove (2.6) in the case when m = and 
E = I. Then we have 

ifx(/i||MO,I)) = If^.ix) ■ [f ln(27r) + ilndet(I) + 'Jxf]dx 

= ^\n{2n) + y f^,{x)\\{x-m^,)+mj^dx 

= f ln(27r) +y ff,{x)[\\x - m^f + ||m^f + 2{x - m^) o m^]c?a; 



ln(27r) + ^tr(S^) + |||m^| 



_ N 
2 

□ 

By Q we denote the set of all normal densities, while by we denote the 
set of all normal densities with covariance S. As a trivial consequence of the 
Theorem 2A_ we obtain the following proposition. 

Proposition 2.1. LetT, be a fixed positive symmetric matrix. Then MLE{^\\Qy,) 
{A/^m^^S)} andH'^ifiWOj,) = f ln(27r) + itr(S-^S^) + |lndet(S). 

Now we consider cross-entropy with respect to all normal densities. 

Proposition 2.2. We have MLE(/i||^) = {A^(m/„S^)} and 

H^fi\\g) = ^lndet(S^) + ^ln(27re). (2.7) 

Proof. Since entropy is minimal when we code a measure by its own density, 
we easily obtain that 



HX{f,\\g) = HX{f,g\\g) = H-{f,g\\Ar{m„E,)) 

1 N 
H{fig) = -lndet(S^) + -ln(27re) 

Consequently the minimum is realized for A/fm^, E^). □ 



Due to their importance and simplicity we also consider Spherical Gaus- 
sians ^(-i), that is those with covariance matrix proportional to I: 



'(■I) = u ^^i- 

s>0 
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We will need the denotation for the mean squared distance from the meaij^ 

D/, := J \\x - m^f (i/i(a;) = tr(S^), 

which will play in Spherical Gaussians the analogue of covariance. As is the 
case for the covariance, we will use the abbreviation 

Du ■= = \\x - m^|pd/i(x). 

Observe, that if = si then = Ns. In the case of one dimen- 
sional measures \/D^ is exactly the standard deviation of the measure /i. It 
occurs that can be naturally interpreted as a square of the "mean ra- 
dius" of the measure /i: for the uniform probability measure /i on the sphere 
S{x, R) C we clearly get -^/I^ = R- Moreover, as we show in the fol- 
lowing observation ^yD^ will be close to R even for the uniform probability 
distribution on the ball B{x, R). 

By A we denote the Lebesgue measure on M^. Recall that according 
to our notation A{/ denotes the probability measure defined by \u{A) := 

x{Anu)/\{u). 

Observation 2.2. We put K := A(5„(0, 1)) = 7r"/Vr(n/2 + 1), where 
Bn{0, 1) denotes the unit ball in M". 

Consider the unit ball 5(0,1) C M^. Directly from the definition of 
covariance we get S^^q ^-^ = cnI, where 

1 



0Fr((Ar-l)/2+l) J^-^ \^ ) "-^ N+2- 



Consequently, 



'B{x,R) ^-8(0,1) Af+2' 

)A _ ^ (yX \ _ NBl 

'b{x,R) — ^^y^B{x,R)) — N+2- 



{21 



and therefore ^Jd^^ 



_ N 
R) V N+2 



R ^ R as N ^ oo. 



^We will see that it corresponds to the mean within clusters some of squares. 
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Proposition 2.3. We have MLE(/i||^(.i)) = {Mim^,, ^l)} and 

H"" (/ill^(.i)) = y HD,) + y ln(27re/iV). (2.9) 



Proof. Clearly by Proposition 2.1 



= inpx(/i||^,i) = inf + |ln. + |ln(27r)^ . 

Now by easy calculations we obtain that the above function attains minimum 



for s = D^/N and equals the RHS of (2.9). □ 



At the end we consider the cross-entropy with respect to Qsi (spherical 
Gaussians with fixed scale). As a direct consequence of Proposition 2.1 we 
get: 

Proposition 2.4. Let s > be given. Then MLE(ix\\Qsi) = {A/^m^, si)} and 



H^f,\\gs,) = ^^D, + ^\ns + ^\n{27r) 



3. Many coding subdensities 
3.1. Basic idea 

In the previous section we considered the coding with d-symbols of the 
^-randomly chosen point x G by the code optimized for the subdensity 
/. Since it is often better to "pack/compress" parts of data with various 
algorithms, we follow this approach and assume that we are given a sequence 
of k subdensitie^ (/j)jL]^, which we interpret as coding algorithms. 

Suppose that we want to code x by j-th algorithm from the sequence 



(/i)i=i- By (2.1 ) the length of code of x corresponds to — log^ fj{x). However, 
this code itself is clearly insufficient to decode x if we do not know which 
coding algorithm was used. Therefore to uniquely code x we have to add to 
it the code of j. Thus if Ij denotes the length of code of j, the "final" length 
of the code of the point x is the sum of Ij and the length of the code of the 
point x: 

code-length of a; = Ij — log^ fj{x). 



In general we accept also k = oo. 



12 



Since the coding of the algorithms has to be acceptable, the sequence {h)f=i 
has to satisfy the Kraft's inequality and therefore if we put pi = d~^\ we 

k 

can consider only those > that YliVi ^ 1 • Consequently without loss of 

i=l 

generality (by possibly shortening the expected code- length), we may restrict 

k 

to the case when ^ = 1. 

i=l 

Now suppose that points from Ui C we code by the subdensity /j. 
Observe that although Ui have to be pairwise disjoint, they do not have to 
cover the whole space - we can clearly omit the set with /x-measure zero. 
To formalize this, the notion of /i-partitioiij^for a given continuous or discrete 
measure /i is convenient: we say that a pairwise disjoint sequence of 
Lebesgue measurable subsets of is a ^-partition if 



To sum up: we have the "coding" subdensities and p G P^, where 

k 

Pfc:={(pi,...,Pfc)e [0,1]'=: J]p, = l}. 

1=1 

As Ui we take the set of points of we code by density /j. Then for a 
/i-partition we obtain the code-length function 

a; -> - log^Pi - logrf fi{x) for xeUi, 



which is exactly the code-length of the subdensit> 
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Pi/i|c/i U . . . Upfc/fc|i7,.. (3.1) 

In general we search for those p and /i-partition for which the expected code- 
length given by the cross-entropy H^(^\\ U Pifi\u^^ will be minimal. 

i=l ' 



^We introduce /Li-partition as in dealing in practice with clustering of the discrete data 
it is natural to partition just the dataset and not the whole space. 
^"Observe that this density is defined for /x-almost all x G M^. 
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Definition 3.1. Let {J^i)f^^ be a sequence of subdensity families in M^, and 
let a ^-partition {Ui)f^i be given. Then we define 

^ i^m := { U mfilu. : fe)ti e P., if,)l, e (J-i)ti}. 

1=1 1=1 
k 

Observe that W {J^i\Ui) denotes those compression algorithms which can 

1=1 

be build by using an arbitrary compression subdensity from J-i on the set f/j. 

3.2. Lloyd's algorithm 

The basic aim of our article is to find a /x-partition {Ui)^^i for which 

is minimal. In general it is NP-hard problem even for k-means [27] , which is 
the simplest limiting case of Spherical CEC (see Observation 4.7). However, 
in practice we can often hope to find a sufficiently good solution by applying 
either Lloyd's or Hartigan's method. 

The basis of Lloyd's approach is given by the following two results which 
show that 

• given p E Pk and (/i)jLi G (J-'j)*Li, we can find a partition (f/j)f=i which 

minimizes the cross-entropy H^(fi\\ U Pifi\ui)', 

1=1 

• for a partition {Ui)^^^, we can find p E Pk and G (-7^j)i=i which 
minimizes H^'ifiW U Pifi\uX 

i=l 

We first show how to minimize the value of cross-entropy being given a 
/i-partition {Ui)^^^. From now on we interpret ■ x as zero even if x = ±oo 
or X is not properly defined. 

Observation 3.1. Let (fi) G (J^i), p E Pk and {Ui)^^^ be a [x-partition. Then 

k 

H-{fi\\ U^pJ-lu,) = Y^fim ■ {-lnp, + H^f,uM)) ■ (3.2) 

4 = 1 
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Proof. We have 



H^'if^W U Pifilu,) = E /{/, - In Pi - logd fi{x)d^{x) 

A; 

J2f^iUt) ■ (-In Pi - /ln(/i(x))c//ic/,(x)) . 



i=l 

□ 

Proposition 3.1. Lei the sequence of suhdensity families {J^i)'l^i he given 
and let {Ui)^^i be a fixed ^-partition. We put p = (/i(f/j))*L]^ G Pk- 
Then 

= H^{fl\\ Up^f^\u^) 
t=l 1=1 



1=1 



Proof. We apply the formula (3.2) 

k 

H^ifiW U^pJ.lu.) = Y,^im ■ {-\np, + H^^lu^\f^)) . 
1=1 

By the property of classical entropy we know that the function 

k 

Pk3p= {Pi)-=i ^ J]/i(f/.) ■ (-In Pi) 

i=l 

is minimized for p = {fi{Ui))i. □ 
The above can be equivalently rewritten with the use of notation: 

h^^jr. ^^'(^) ■ ( - HK^)) + M:f)) if ^^{w) > o, 

' 1 otherwise. 

Thus h^{J^] W) tells us what is the minimal cost of compression of the part 



of our dataset contained in W by subdensities from J-". By Proposition 3.1 
if (t/i)f=i is a /i-partition then 

k 

W (J-i|t/,)) = V h^{J^,; U,). (3.3) 

i=l ' 

i=l 
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Observe that, in general, if /u(?7) > then 



Consequently, if we are given a /i(/-partition (f/j)f^^, then 
k 1 

Theorem 3.1. Let the sequence of suhdensity families he given and 

let {Ui)f^i be a fixed ^-partition. 

We put p = {ii{Ui))^^i G Pfc. We assume that MLE(/x;7. || J^j) is nonempty 
for every i = l..k. Then for arbitrary 

fi e MLEifiu, II J-,) fort = l,...,k, (3.4) 

we get 

H'<{fi\\ W (J-i|f/,)) =i/^(/i|| W P./.lt/J- 

1=1 1=1 

Proof. Directly from the definition of MLE we obtain that 

(^^. II /.) > im m = (m II/.) 

for fieTi. □ 



The following theorem is a dual version of Theorem 3.1 - for fixed p G 
and fi e J^i we seek optimal /i-partition which minimizes the cross-entropy. 

By the support of measure /i we denote the support of its density if is 
continuous and the set of support points if it is discrete. 

Theorem 3.2. Let the sequence of subdensity families (J-i)f^]^ be given and 

k 

let fi G J-i and p E Pk be such that supp(yu) C |J supp(/i). We define 

i=l 

I : supp(/i) — !■ (— oo, oo] by 

l{x) := min [— In pi — In fi{x)]. 

ie{l,...,k} 

We construct a sequence {Ui)^^-^ of measurable subsets of^^ recursively 
by the following procedure: 
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• Ui = {x & supp(/i) : — Inpi — ln/i(x) = l{x)}; 

• Ui+i = {x e supp(yu) \(UiU . . .UUi) : -Inp^+i - \n fi+i{x) = l{x)}. 
Then (t/j)^^^ is a ^-partition and 

k k 

U Pifi\u^ = inf{/7^(/i|| U Pifilv,) : jJ-parUUon 

i=l i=l 

k 

Proof. Since supp(/i) C |J supp(/i), we obtain that {Ui)f^i is a /i-partition. 

1=1 

Moreover, directly by the choice of {Ui)^^i we obtain that 

k 

l{x) = ln( U Pifi\u,){x) for x E supp(/i), 

i=l 

and consequently for an arbitrary /x-partition {Vi)^^^ we get 

HHf^W ^^P^MvJ = / A [ - l^(P^) - ln(/.k,(x))]rf/i(x) 
j=i 1=1 

< J l{x)d^{x) = /^Uj - \n{pi) - \n{fi\uXx))\d^i{x). 

□ 

As we have mentioned before, Lloyd's approach is based on alternate use 



of steps from Theorems 3J_ and |3.2[ In practice we usually start by choosing 
initial densities and set probabilities pk equal: p = {1/k, . . . ,1/k) (since 
the convergence is to local minimum we commonly start from various initial 
condition several times). 



Observe that directly by Theorems |3.1| a nd |3.2| we obtain that the se- 
quence n — )■ /i„ is decreasing. One hopes that limit converges (or at 

least is reasonably close) to the global infimum of H^(jj\\ l±l (J-'j|f/j)). 

i=l 

To show a simple example of cross-entropy minimization we first need 
some notation. We are going to discuss the Lloyds cross-entropy minimiza- 
tion of discrete data with respect to ^s^, . . . ,Qt,k- a direct consequence 
of (3.3) and Proposition 2.1 we obtain the formula for the cross entropy of 
fi with respect to a family of Gaussians with covariances (Sj)^^]^. 



To enhance that chance we usuaUy start many times from various initial clustering. 
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(a) T-like set. (b) Partition into two 

groups. 



Figure 3: Effect of clustering by Lloyd's algorithm on T-like shape. 



Observation 3.2. Let (Sj)^^]^ be fixed positive symmetric matrices and let 
(f^i)i=i ^ given ^-partition. Then 

^"Ml .W(^sJt/.)) = 
1=1 

f ln(27r) + X: ^Jim [- H^lm) + ^tr(SriSj;^) + \ In det(S,)] • 

i=l 

Example 3.1. We show Lloyd's approach to cross-entropy minimization of 
the set Y showed on Figure 3(a). As is usual, we first associate with the 
data-set Y the probability measure defined by the formula 



cardY^^^^' 



where Sy denotes the Dirac delta at the point y. 

Next we search for the ^-partition Y = Yi U Y2 which minimizes 

where Ei = [300,0; 0,1], S2 = [1,0; 0,300]. The result is given on Figure 



3(b), where the dark gray points which belong to Yi are "coded" by density 



from Qy,^ and light gray belonging to Y2 and are "coded" by density from Q^^- 

3.3. Hartigan algorithm 

Due to its nature to use Hartigan we have to divide the data-set (or 
more precisely the support of the measure /i) into "basic parts/blocks" from 
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which we construct our clustering/grouping. Suppose that we have a fixed 
V = The aim of Hartigan is to find such //-partition 

build from elements of V which has minimal cross-entropy. 

Consider k coding subdensity families {J^i)^^^. To explain Hartigan ap- 
proach more precisely we need the notion of group membership function 
gr : {1,...,?T,} — {0,...,k} which describes the membership of i-th ele- 
ment of partition, where value is a special symbol which denotes that Vi is 
as yet unassigned. In other words: if gr(z) = / > 0, then Vi is a. part of the 
l-th group, and if gr(i) = then Vi is unassigned. 

We want to find such gr : {1, . . . , n} — )• {1, . . . , A;} (thus all elements of V 
are assigned) that 

k 

1=1 

is minimal. Basic idea of Hartigan is relatively simple - we repeatedly go 
over all elements of the partition V = (Vi)"^^ and apply the following steps: 

• if the chosen set Vi is unassigned, assign it to the first nonempty group; 

• reassign Vi to those group for which the decrease in cross-entropy is 
maximal; 

• check if no group needs to be removed/unassigned, if this is the case 
unassign its all elements; 

until no group membership has been changed. 

To practically apply Hartigans algorithm we still have to decide about 
the way we choose initial group membership. In most examples in this pa- 
per we initialized the cluster membership function randomly. However, one 
can naturally speed the clustering by using some more intelligent cluster 
initialization which are often commonly encountered in the modifications of 
k-means (one can for example easily use k-means-|-+ approach [2H])- 

To implement Hartigan approach for discrete measures we still have to 
add a condition when we unassign given group. For example in the case of 
Gaussian clustering in to avoid overfitting we cannot consider clusters 
which contain less then + 1 points. In practice while applying Hartigan 



^^By default we think of it as a partition into sets with small diameter. 



/i-partitioE 
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approach on discrete data we usually removed clusters which contained less 

then three percent of all data-set. 

Observe that in the crucial step in Hartigan approach we compare the 
cross entropy after and before the switch, while the switch removes a given 
set from one cluster and adds it to the other. Since 

h^iJ'- w) = ^x{w) ■ ( - Hi^iw)) + H^'^iiwm), 

basic steps in the Hartigan approach reduce to computation of {^ly/WT)) 
for W = U U V and W = U \ V. This implies that to apply efficiently the 
Hartigan approach in clustering it is of basic importance to compute 

• H^{ijLuuv\\J^) for disjoint U,V; 

• H^'ifiuwWJ^) for y C [/. 

Since in the case of Gaussians to compute the cross-entropy of /iw we need 
only covariance T,'^, our problem reduces to computation of T^uuv and I^u\v- 
One can easily verify that for convex combination of two measures we have: 

Theorem 3.3. Let U, V be Lebesgue measurable sets with finite and nonzero 
jjL-measures. 

a) Assume additionally that U fl y = 0. Then 

%vjv = Pu^u + Pv^v + PuPvM - my)(m^ - m^)^, 

where mr = dt/ — 

wnere pu ^([7)+^(y) > Pv ■ |^{u)+^c{v) ■ 

b) Assume that V G U is such that IJ>{V) < ii{U). Then 

E^^^ = qu^'^j - qvK " ^luqviK - - 
where qu := j^t^^y qv := 

4. Clustering with respect to Gaussian families 

4.I. Introduction to clustering 

In the proceeding part of our paper we study the applications of our 
theory for clustering, where by clustering we understand division of the data 
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into groups of similar type. Therefore since in clustering we consider only 
one fixed subdensity family J-" we will use the notation 

k 

h,{J^;ml,):=Y,h,{J';U,), (4.1) 

i=l 

for the family of pairwise disjoint Lebesgue measurable sets. We see 

that (4.1) gives the total memory cost of disjoint J-'-clustering of (f/j)*L]^. 

The aim of J-'-clustering is to find a /i-partition {Ui)f^i (with possibly 
empty elements) which minimizes 

i=l 

Observe that the amount of sets (Ui) with nonzero /i-measure gives us the 
number of clusters into which we have divided our space. 

In many cases we want the clustering to be independent of translations, 
change of scale, isometry, etc. 

Definition 4.1. Suppose that we are given a probability measure /i. We say 
that the clustering is A-invariant if instead of clustering /i we will obtain the 
same effect by 

• introducing /i^ := /i o (observe that if fi corresponds to the data Y 
then fiA corresponds to the set A{Y) ); 

• obtaining the clustering (V^)f=i of ^a; 

• taking as the clustering of n the sets Ui = A^^iVi). 

This problem is addressed in following observation which is a direct con- 



sequence of Corollary 2.5 



Observation 4.1. Let be a given subdensity family and A be an affine 
invertible map. Then 

H'^UW W(^|f/,)) =/J^(/ioA-i|| ^{TA\A{Ui))) +\n\detA\. 

i=l 1=1 
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As a consequence we obtain that if T is A-invariant, that is J-" = Ta-, 
then the T clustering is also A- invariant. 

The next important problem in clustering theory is the question how to 
verify cluster validity. Cross entropy theory gives a simple and reasonable 
answer - namely from the information point of view the clustering 

f/ = f/i U . . . U f/fc 

is profitable if we gain on separate compression by division into ^^^^ 
is when: 

h,{7- mti)<K{r, u). 

This leads us to the definition of -divergence of the splitting f/ = f/i U . . . U 

d,{:F- mU) ■■= KiJ"; u) - h,{:F- mU)- 

Trivially if d^{J^] {Ui)\^]) > then we gain in using clusters (f/j)f^^. More- 
over, if is a /i-partition then 

Observe that the above formula is somewhat reminiscent of the classical 
Kullback-Leibler divergence. 

4-2. Gaussian Clustering 

There are two most important clustering families one usually considers, 
namely subfamilies of gaussian densities and of uniform densities. In general 
gaussian densities are easier to use, faster in implementations, and more 
often appear in "real-life" data. However, in some cases the use of uniform 
densities is preferable as it gives strict estimations for the belonging of the 
data pointj^ 

Remark 4.1. Clearly from uniform families, due to their affine invariance 
and "good" covering properties, most important are uniform densities on 
ellipsoids. Let us mention that the clustering of a dataset Y by ellipsoids 



^•^For example in computer games we often use bounding boxes/elipsoids to avoid un- 
necessary verification of non-existing collisions. 
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described in which aims to find the partition Y = Yi U . . . U which 
minimizes 

k 

-^Wi- lndet(SYj, 

i=l 

where Wi are weights, is close to CEC based on uniform densities on ellipsoids 
in which, as one can easily check, reduces to the minimization of: 

-X^p(Y,) ■lndet(EYj - | sh(p(Y.)), 
1=1 1=1 

where p{Yi) := card ( Yj ) /car d(Y). 

From now on we fix our attention on Gaussian clustering (we use this 
name instead ^-clustering). By Observation 4.1 we obtain that the Gaussian 
clustering is invariant with respect to affine transformations. 

By joining Proposition 2.2 with (4.1) we obtain the basic formula on the 
Gaussian cross-entropy. 

Observation 4.2. Let {Ui)^^i be a sequence of pairwise disjoint measurable 
sets. Then 

KiG; mti) = i:/^(t/.)-[f ln(27re) - ln(/x(f/,)) + ^ lndet(S^J]. (4.2) 
1=1 

In the case of Gaussian clustering due to the large degree of freedom we 
were not able to obtain in the general case a simple formula for the divergence 
of two clusters. However, we can easily consider the case of two groups with 
equal covariances. 

Theorem 4.1. Let us consider disjoint sets Ui,U2 C with identical co- 
variance matrices = Sj^^ = S. Then 

d,{G; (t/i, f/2))/(/x(f/i) + /x(f/2)) = \ ln(l + pmWK, - III) - sh(pi) - sh(p2), 

where pi = fi{Ui) / {fi{Ui) + /i(f/2))- 
Consequently df^{Q; {Ui,U2)) > iff 

- <Jl > Pf''-'P2'''-' ~PiW- (4.3) 
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Proof. By (|42|) 

df^iG; {Ui, f/2))/(/i(f/i) + /i(t/2)) = |[lndet(E^^u^J - Indet(S)] - sh(pi) - sh(p2). 



By applying Theorem 3.3 the value of Ef^^^ii/a simplifies to S + pip2Tam^, 
where m = (mj^^ — m^)^), and therefore we get 

c^M(6^;(t/i,?72))/(/i(t/i) + Mt^2)) 

= |lndet(I + pip2S-^/2mm'^S"i/2) _ sh(pi) - sh(p2) 
= |lndet(I + pip2(S"^/2m)(S-i/2jn)^) -sh(pi) -sh(p2). 

Since det(/ + Q;ff = l + |p (to see this it suffices to consider the matrix 
of the operator / + avv'^ in the orthonormal base which first element is 
II), we arrive at 

d^^iG; (f/i, U2))/{KUi) + fi{U2)) = I ln(l + piP2||m|||) - sh(pi) - sh(p2). 
Consequently d^{G] {Ui, U2)) > iff 

ln(l +piP2||m|||) > 2sh(pi) + 2sh(p2), 
which is equivalent to 1 +pip2||ni||| > Pi^^^P2'^^^- D 



Remark 4.2. ^45 a consequence of (4.3) we obtain that if the means of Ui 
and U2 are sufficiently close in the Mahalanobis \\ ■ ||e distance, then it is 
profitable to glue those sets together into one cluster. 

Observe also that the constant in RHS of (4.3) is independent of the 
dimension. We mention it as an analogue does not hold for Spherical clus- 



tering, see Observation 4-4 



Example 4.1. Consider the probability measure fig on M given as the convex 
combination of two gaussians with means at s and —s, with density 

f^:=lM{s,l) + l^{-s,l), 

where s > 0. Observe that with s — > 00 the initial density N'iO, 1) separates 
into two almost independent gaussians. 

To check for which s the Gaussian divergence will see this behavior, we 
fix the partition (—00, 0), (0, 00). One can easily verify that 



d^^{G;{{-oo,0),{0,oo))) 

- ln(2) + i ln(l + - i ln[l -'-^ + s'- 7f .e'^Erf (^) - s^Erii^^j 
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(a) Plot of Gaussian divergence. 



(b) Densities. 



Figure 4: Convex combination of gaussian densities. Black thick density is the bordering 
density between one and two clusters 



Consequently, see Figure 4(o-) < there exists Sq ~ 1.518 such that the clustering 



ofW into two clusters ((— oo, 0), (0, oo)) is profitable iff s > sq. On figure 



4(b) we show densities fs for s = (thin line); s = 1 (dashed line); s = Sq 
(thick line) and s = 2 (points). 

This theoretical result which puts the border between one and two clusters 
at So seems consistent with our geometrical intuition of clustering of fig- 

4-3. Spherical Clustering 

In this section we consider spherical clustering which can be seen as a 



simpler version of the Gaussian clustering. By Observation 4.1 we obtain 
that Spherical clustering is invariant with respect to scaling and isometric 
transformations (however, it is obviously not invariant with respect to affine 
transformations) . 

Observation 4.3. Let (Ui)^^-^ be a ^-partition. Then 

h.iGi.iy, mU) = E/i(f/.)-[f ln(27re/iV) - ln(^(f/,)) + f InD^^J. (4.4) 

i=l 

To implement Hartigan approach to Spherical CEC and to deal with 
Spherical divergence the following trivial consequence of Theorem 3.3 is use- 
ful. 

Corollary 4.1. Let U,V be measurable sets. 

a) Assume additionally that U nV = ^ and fi{U) > 0,fi(y) > 0. Then 

Duuv = PuD'i^+PvD'(,+puPv\\'^u-^vf^ 
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(a) Three circles, (b) k- means with (c) Spherical CEC. 
k = 3. 

Figure 5: Comparison of classical k- means clustering (with fc = 3 clusters) with Spherical 
CEC (with initially 10 clusters) of three ,, mouse-like" circles. 



where pu '■ 



b) Assume that V G U is such that fi{V) < fi{U). Then 



m 



D 



u\v 
u\v 



(luD'ij - qvD 



IJ- 112 



where qu := ^^(fjfzj^y Qv ■- ^(c/)-^(y) ■ 

Example 4.2. We considered the uniform distribution on the set consisting 
of three disjoint circles We started CEC with initial choice of 10 clusters, as 
a result of Spherical CEC we obtained clustering into three circles see Figure 
- compare this result with the classical k-means with = 3 on Figure 
Observe that contrary to classical k-means in spherical clustering we 



5(c) 



5(b) 



do not obtain the "mouse effect" . 

Let us now consider when we should join two groups. 

Theorem 4.2. Let Ui and U2 be disjoint measurable sets with nonzero /i- 
measure. We put Pi 
i = 1,2. Then 



^{Ui)/{^{Ui) +12(1/2)) and nij = m^., A = D^^ for 



iU,,U2))/{KUi)+f^iU2)) 



^\n{piDi +P2D2 +piP2\\mi - m2|p) -pi^lnDi -p2f-ln-D2 - sh(pi) - sh(p2)- 



Consequently, d^{Q(^.iy, {Ui,U2)) > iff 



|mi-m2|| > 



2p^/N 2p2/N 

Pi P2 



{piDi+P2D2). 
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Proof. By (4.4) 



d,{g(^.,y,{U^,U2))/{^i{U^) + fi{U2)) 



N 



ln{D'^ jj ) -pif InDi -p2^\nD2- sh(pi) - sh(p2)- 



Since by Corollary 4.1 



Du^+U2 = Pi^i +P2D2 +PiP2||mi - , 
we obtain that {Ui,U2)) > iff 

JJPI JJP2 
Pi P2 



□ 



Observation 4.4. Lei ns simplify the above formula in the case when we 
have sets with identical measures ii{Ui) = fi{U2) and D := D'^^ = D^j^. Then 
by the previous theorem we should glue the groups together if 



nil ~ 



< ^41/^ 



where r = yD. So, as we expected, when the distance between the groups is 
proportional to their '^radius" the joining becomes profitable. 
Another, maybe less obvious, consequence of 

is that with the dimension N growing we should join the groups/sets together 
if their centers become closer. This follows from the observation that if we 
choose two balls in with radius r and distance between centers R > 2r, 
the proportion of their volumes to the volume of the containing ball decreases 
to zero with dimension growing to infinity. 

4.4- Fixed covariance 

In this section we are going to discuss the simple case when we cluster 



by Q-^, for a fixed S. By Observation |4.1| we obtain that Qy; clustering is 
translation invariant (however, it is obviously not invariant with respect to 
scaling or isometric transformations). 
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Observation 4.5. Let S be fixed positive symmetric matrix, and let {Ui)^^^ 
be a sequence of pairwise disjoint measurable sets. Then 

= E/i(f/.)-(f M2vr) + |lndet(S)) + ^(fZ.) • [- ln(/i(t^,)) + iMS-^S^^J], 

i=l i=l 

This implies that in the clustering we search for the partition {Ui)f^, 
which minimizes 



i=l 

Now we show that in the Qj] clustering, if we have two groups with cen- 
ters/means sufficiently close, it always pays to "glue" the groups together 
into one. 

Theorem 4.3. Let Ui and U2 be disjoint measurable sets with nonzero /i- 
measure. We put pi = fi{Ui) / {^{Ui) + fi{U2)) . Then 

d.ig^; (t/i, f/2))/(/i(t/i) + /i(t/2)) = PiP2\\K, - <M - sh(Pi) - sh(p2). 

(4.5) 

Consequently d^{QY.] {Ui, U2)) > iff 

Proof. We have 

df.{G^;{Ui,U2))/{fi{U,) + fi{U2)) 

= 'M^-'K.uu,) - f tr(S-iS^;j - f tr(E-iE[;j - sh{p,) - sh(p2). 

(4.6) 

Let m = ni^-m|^^. Since S^^^ = piJ:'^^+p2T.'^^+piP2ram^ , and ti^AB) = 
tT{BA), simplifies to (|45|). □ 

Observe that the above formula is independent of deviations in groups, 
but only on the distance of the centers of weights (means in each groups). 

Lemma 4.1. The function 

{te.P.)e(0.l)-p,+P, = l}^*<^'' + *<^'^> 



P1P2 

attains global minimum In 16 at pi = P2 = 1/2. 
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Proof. Consider 

p{l-p) 

Since w is symmetric with respect to 1/2, to show assertion it is sufficient to 
prove that w is convex. 
We have 

, 2(-l+p)3ln(l-p) +p(-l+p-2p2ln(p)) 

^ ■ 

Since the denominator of w" is nonnegative, we consider only the numerator, 
which we denote by g{p). The fourth derivative of g equals 12/[p(l — p)]- 
This implies that 

g"{p) = 4(-2 + 3(-l +p) ln(l - p) - 3pln(p)) 

is convex, and since it is symmetric around 1/2, it has the global minimum 
at 1/2 which equals 

/(1/2) = 4(-2 + 31n2) = 4\n{8/e'^) > 0. 

Consequently g"{p) > for p e (0, 1), which implies that g is convex. Being 
symmetric around 1/2 it attains minimum at 1/2 which equals g'(l/2) = 
|ln(4/e) > 0, which implies that g is nonnegative, and consequently w" 
is also nonegative. Therefore w is convex and symmetric around 1/2, and 
therefore attains its global minimum 41n2atp = l/2. □ 

Corollary 4.2. // we have two clusters with centers mi and m2, then it is 
always profitable to glue them together into one group in Qs- clustering if 

\\mi — m2||s < Vln 16 ~ 1.665. 

As a direct consequence we get: 

Corollary 4.3. Let /j be a measure with support contained in a bounded 
convex set V. Then the number of clusters which realize the cross-entropy 
Qy, is hounded from above by the maximal cardinality of an e-net (with respect 
to the Mahalanobis distance \\ ■ where £ — -\/4 In 2, in V. 
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Proof. By k we denote the maximal cardinality of the e-net with respect to 
the Mahalanobis distance. 

Consider an arbitrary yU-partition {Ui)\^i consisting of sets with nonempty 
/i-measure. Suppose that I > k. We are going to construct a //-partition with 
I — 1 elements which has smaller cross-entropy then (f/j). 

To do so consider the set (mj).)'^^ consisting of centers of the sets f/j. 
By the assumptions we know that there exist at least two centers which are 
closer then e - for simplicity assume that — mj}J|s < e. Then by the 

previous results we obtain that 

Ui^i U Ui) < h^iQj:; Ui^,) + h^iG^; U{). 

This imphes that the yU-partition (?7i, . . . , t//_2, Ui-i U Ui) has smaller cross- 
entropy then {Ui)\^i. □ 

^.5. Spherical CEC with scale and k-means 

We recall that Qsi denotes the set of all normal densities with covariance 
si. We are going to show that for s — > results of ^si-CEC converge to 
k-means clustering, while for s — t- oo our data will form one big group. 

Observation 4.6. For the sequence {Ui)f^i we get 

h,{g,,- = Em) ■ (f ln(2vr.) -lnyu(f/,) + ^D^^J. 

i=l 



Clearly by Observation |4.1| Qsi clustering is isometry invariant, however 
it is not scale invariant. 

To compare k-means with Spherical CEC with fixed scale let us first 
describe classical k-means from our point of view. Let /i denote the discrete 
or continuous probability measure. For a /i-partition we introduce 

the within clusters sum of squares by the formula 

k 

ss(/i||(f/OLi) ■= E lu, Ik - T^uM'^M^) 

i=l 

1=1 i=l 

Remark 4.3. Observe that if we have data Y partitioned into Y = Yi U 
... U Yfc, then the above coincides (modulo multiplication by the cardinal- 
ity of Y) with the classical within clusters sum of squares. Namely, for 
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(a) k- means with fc = 5. 



(b) g.i-CEC for s = 
10^^ and 5 clusters. 
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discrete probability measure hy ■= ^^^^.y) X] '^^ have ss(/iY||(Yj)jL]^) = 

k 

In classical k-means the aim is to find such /i-partition (f/j)*l^^ which 
minimizes the within clusters sum of squares 

k 

(4.7) 

1=1 

while in ^si-clustering our aim is to minimize 

k 



N 
1=1 



Obviously with s — )■ 0, the above function converges to (4.7), which im- 
plies that k-means clustering can be understood as the limiting case of Qsi 
clustering, with s — )■ 0. 



Example 4.3. We compare on Figure 4-3 Qsi clustering of the square [0, 1]^ 
with very small s = 5-10"^ to k-means. As we see we obtain optically identical 
results. 

Observation 4.7. We have 

< ss(^||(f/,)ii) - h.ln(27rs) + '^Hx{fx\\ ^^^{0,,^)] 

1=1 
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This means that for an arbitrary partition consisting of k- sets ss(yu||-) can be 
approximated (as s — > 0^ with the affine combination of H^{^iJi\\Qsi) , which can 
be symbolically summarized as interpretation of k-means as Qo.i clustering. 

If we cluster with s — ?■ oo we have tendency to build larger and larger 
clusters. 

Proposition 4.1. Let ^ be a measure with support of diameter d. Then for 

s > 



In 16 



the optimal clustering with respect to Qsi will be obtained for one large group. 

More precisely, for every k > 1 and ^-partition {Ui)^^^ consisting of sets 
of nonempty fj,-measure we have 



H''{i2\\T) <i7^(/i|| .W(^|f/,)). 



Proof. By applying Corollary 4^ with S = si we obtain that we should 
always glue two groups with centers mi,m2 together if ||mi — m2||gj < In 16, 
or equivalently if ||mi — m2|p < s In 16. □ 

Concluding, if the radius tends to zero, we cluster the data into smaller 
and smaller groups, while for the radius going to oo, the data will have the 
tendency to form only one group. 
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